I think at this point I have a pretty solid understanding of when PLS-DA works well, and when other methods work as well or better.
Here’s what I can say so far (further explanation and plots follow):
PCA is really terrible at finding a needle in a haystack. That is, if the variables that separate your groups are a small proportion of the total variables, PCA will not show separation between groups, especially if they don’t covary. The advice given by Worley and Powers to validate PLS-DA with PCA is not good advice.
Even if a PLS-DA is terrible it will show some separation in a score plot. Authors should NOT publish plots of bad PLS-DA models. Maybe they shouldn’t even publish plots of good models, since they don’t look that different than plots of bad models. Bi-plots or score plots side-by-side with loading plots could be good. Maybe if you have more than 2 groups, score plots could be informative. Otherwise, they are misleading.
PERMANOVA seems to always give lower p-values compared to PLS, with the exception of an extreme needle-in-a-haystack scenario—that is, many correlated variables that are not correlated to group membership and just a few variables that differ among groups.
library(chemhelper)
#this is a package I wrote which contains a funciton to simulate multivariate datasets with different pieces. Install with devtools::github_install("Aariq/chemhelper")
library(tidyverse)
library(ropls)
library(vegan)
library(cowplot)
library(broom) #for tidy() which turns model output into data frames.
library(purrr)
library(iheatmapr) #you can also use heatmap(), but I like the iheatmap() colors better, plus its interactive
If you’re interested in how I’m simulating multivariate datasets check the help file or code for chemhelper::sim_multvar()
#not run
?sim_multvar
View(sim_multvar)
Briefly:
p_uncorvar: How many uncorrelated variables?p_corvar: How many correlated variablesp_discr: How many discriminating variablescov_corvar: What covarinace to use for the correlated variables?cov_discr: What covarinace to use for the discriminating variables?diff_discr: How different are the two groups for the discriminating variables?N: How many observations/rows?When discriminating variables are minority of variables and there are many co-varying variables not correlated to group membership, PCA does a really bad job at finding separation. This is especially true when there are fewer observations than variables.
data1 <- sim_multvar(p_uncorvar = 20,
p_corvar = 15,
p_discr = 5,
cov_corvar = 0.5,
cov_discr = 0.3,
diff_discr = 2,
N = 20,
seed = 222)
data1 %>%
select(-group) %>%
cor() %>%
iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
The discriminating variables are a tiny fraction in the lower left corner.
pca1 <- opls(select(data1, -group), plot = FALSE)
## PCA
## 20 samples x 40 variables
## standard scaling of predictors
## R2X(cum) pre ort
## Total 0.568 5 0
pca1_plot <- plot_pca(pca1, data1$group)
plsda1 <- opls(select(data1, -group), data1$group, plot = FALSE, permI = 200)
## PLS-DA
## 20 samples x 40 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.28 0.945 0.721 0.127 2 0 0.02 0.005
plsda1_plot <- plot_plsda(plsda1)
plot_grid(pca1_plot + theme(legend.position = "top"),
plsda1_plot + theme(legend.position = "top"), ncol = 2, nrow = 1)
PLS-DA clearly does better at finding separation, and we can extract the variables responsible with VIP scores.
It correctly identifies the discriminating variables as the top VIPs. Some of the “uncorrelated” and “correlated” variables have strong correlations with discriminating variables by chance.
get_VIP(plsda1) %>%
filter(VIP > 1) %>%
arrange(desc(VIP))
data2 <- sim_multvar(p_uncorvar = 20,
p_corvar = 20,
p_discr = 0,
cov_corvar = 0.5,
N = 20,
seed = 222)
data2 %>%
select(-group) %>%
cor() %>%
iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
plsda2 <- opls(select(data2, -group), data2$group,
permI = 200,
plot = FALSE,
predI = 2)
## PLS-DA
## 20 samples x 40 variables and 1 response
## standard scaling of predictors and response(s)
## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
## Total 0.334 0.781 -0.0404 0.254 2 0 0.475 0.47
#this is a terrible model, so I have to force it to produce two axes with the predI arguement
Clearly this is a bad model. The \(Q^2\) value is negative, and both p-values are very high.
But the plot still shows some separation.
plot_plsda(plsda2)
In my opinion, it would be extremely misleading to publish this plot, especially if it did not have the \(R^2\) and \(Q^2\) values on it.
I think in general it may be a bad idea to show PLS-DA score plots since good models don’t look that different from bad models. One situation where a score plot might be useful (if its a good model) is plotted next to the loadings.
plsda_scores_plot <- plot_plsda(plsda1)
loadingdata <- get_loadings(plsda1)
plsda_loading_plot <- ggplot(loadingdata, aes(x = p1, y = p2)) +
geom_text(aes(label = Variable, alpha = abs(p1))) +
xlim(-0.5, 0.5) +
ylim(-0.5, 0.5) +
theme_bw() +
scale_alpha("P1 axis correlation") +
ggtitle("Axis Loadings") +
labs(caption = "")
plot_grid(plsda_scores_plot + theme(legend.position = "top"),
plsda_loading_plot + theme(legend.position = "top"),
ncol = 2, nrow = 1, align = "h")
Pros for score plots:
Cons for score plots:
In playing around with simulated datasets, I tested out PERMANOVA, a non-parametric extention of MANOVA that can handle more variables than samples. At first, PERMANOVA seemed to always outperform PLS-DA in finding real differences between groups. But it doesn’t do well if there are many correlated variables that are not related to group structure. This make sense since MANOVA is analagous to ANOVA and uses ratios of within- and among-group covariances. I also tested t-tests on PCA axes, which performs terribly. I feel like I’ve seen this in the literature a couple of times, but I’m not sure how common it really is.
These data represent a situation where PCA might actually be fine. Many discriminating variables with high covariance
nperm = 50
sim3a <- map(1:nperm,
~sim_multvar(p_uncorvar = 5,
p_corvar = 5,
p_discr = 30,
cov_corvar = 0.3,
cov_discr = 0.5,
N = 20,
seed = NA))
sim3a[[1]] %>%
select(-group) %>%
cor() %>%
iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
Runs a PCA on all simulated datasets and does a t-test using PC1 scores. Extracts p-values
pca.ttests3a <-
map(sim3a,
~opls(select(., -group),
plotL = FALSE)) %>%
compact() %>%
map(~get_plotdata(.) %>% .$plot_data) %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1 ~ sim3a[[1]]$group) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
#PC1.effect.size = "estimate",
#PC1.t = "statistic",
PC1.p.value = "p.value")
Does PERMANOVA using euclidean distances and extracts p-value
#ugly code that extracts p-values from PERMANOVA results tables
permanova.p3a <- map_dbl(sim3a,
~ adonis(select(., -group) ~ .$group,
method = "eu")$aov.tab$`Pr(>F)`[1]) %>%
tibble(permanova.p = .)
Run PLS-DA on all. Force 2 axes because every once and a while, a random dataset will produce a bad model that breaks the code. Extracts \(p_{Q^2}\).
plsda3a.list <- map(sim3a,
~opls(select(., -group), .$group,
predI = 2,
permI = 100,
plotL = FALSE))
#Extract p-values and other model stats
PLS.p3a <- map_df(plsda3a.list, ~get_plotdata(.)$model_stats) %>%
select(pQ2)
summary3a <- bind_cols(pca.ttests3a, permanova.p3a, PLS.p3a)
distrplot.df3a <- summary3a %>%
gather(-dataset, key = "method", value = "p.value")
ggplot(distrplot.df3a, aes(x = p.value, fill = method)) +
geom_density(alpha = 0.3) +
geom_vline(aes(xintercept = 0.05), linetype = 2) +
ggtitle("Lots of co-varying, discriminating variables")
In this situation, PLS-DA actually does a pretty awful job. I don’t really understand why. PCA and PERMANOVA are about equivalent.
These are the same parameters as in #1
nperm = 50
sim3b <- map(1:nperm,
~sim_multvar(p_uncorvar = 20,
p_corvar = 15,
p_discr = 5,
cov_corvar = 0.5,
cov_discr = 0.3,
diff_discr = 2,
N = 20,
seed = NA))
sim3b[[1]] %>%
select(-group) %>%
cor() %>%
iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
Runs a PCA on all simulated datasets and does a t-test using PC1 scores. Extracts p-values
pca.ttests3b <-
map(sim3b,
~opls(select(., -group),
plotL = FALSE)) %>%
compact() %>%
map(~get_plotdata(.) %>% .$plot_data) %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1 ~ sim3b[[1]]$group) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
#PC1.effect.size = "estimate",
#PC1.t = "statistic",
PC1.p.value = "p.value")
Does PERMANOVA using euclidean distances and extracts p-value
#ugly code that extracts p-values from PERMANOVA results tables
permanova.p3b <- map_dbl(sim3b,
~ adonis(select(., -group) ~ .$group,
method = "eu")$aov.tab$`Pr(>F)`[1]) %>%
tibble(permanova.p = .)
Run PLS-DA on all. Force 2 axes because every once and a while, a random dataset will produce a bad model that breaks the code. Extracts \(p_{Q^2}\).
plsda3b.list <- map(sim3b,
~opls(select(., -group), .$group,
predI = 2,
permI = 100,
plotL = FALSE))
#Extract p-values and other model stats
PLS.p3b <- map_df(plsda3b.list, ~get_plotdata(.)$model_stats) %>%
select(pQ2)
summary3b <- bind_cols(pca.ttests3b, permanova.p3b, PLS.p3b)
distrplot.df3b <- summary3b %>%
gather(-dataset, key = "method", value = "p.value")
ggplot(distrplot.df3b, aes(x = p.value, fill = method)) +
geom_density(alpha = 0.3) +
geom_vline(aes(xintercept = 0.05), linetype = 2) +
ggtitle("Needle-in-Haystack") +
coord_cartesian(xlim = c(0, 0.25), ylim = c(0, 80))
In this situation, PCA performs terrible, PLS-DA does alright, and PERMANOVA does freakishly well.
This is the situation where I found that PLS-DA actually outperforms PERMANOVA. There are a lot of correlated variables with a stronger correlation compared to 3b, and the difference between the means of the two groups is smaller.
nperm = 50
sim3c <- map(1:nperm,
~sim_multvar(p_uncorvar = 5,
p_corvar = 30,
p_discr = 5,
cov_corvar = 0.7,
cov_discr = 0,
diff_discr = 1.2,
N = 20,
seed = NA))
sim3c[[1]] %>%
select(-group) %>%
cor() %>%
iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
I must admit, this is kind of ridiculous.
Runs a PCA on all simulated datasets and does a t-test using PC1 scores. Extracts p-values
pca.ttests3c <-
map(sim3c,
~opls(select(., -group),
plotL = FALSE)) %>%
compact() %>%
map(~get_plotdata(.) %>% .$plot_data) %>%
#maps t.test() function to all dataframes and converts output into dataframe with tidy()
map_dfr(~t.test(.$p1 ~ sim3c[[1]]$group) %>% tidy(), .id = "dataset") %>%
#selects just columns of interest
select(dataset,
#PC1.effect.size = "estimate",
#PC1.t = "statistic",
PC1.p.value = "p.value")
Does PERMANOVA using euclidean distances and extracts p-value
#ugly code that extracts p-values from PERMANOVA results tables
permanova.p3c <- map_dbl(sim3c,
~ adonis(select(., -group) ~ .$group,
method = "eu")$aov.tab$`Pr(>F)`[1]) %>%
tibble(permanova.p = .)
Run PLS-DA on all. Force 2 axes because every once and a while, a random dataset will produce a bad model that breaks the code. Extracts \(p_{Q^2}\).
plsda3c.list <- map(sim3c,
~opls(select(., -group), .$group,
predI = 2,
permI = 100,
plotL = FALSE))
#Extract p-values and other model stats
PLS.p3c <- map_df(plsda3c.list, ~get_plotdata(.)$model_stats) %>%
select(pQ2)
summary3c <- bind_cols(pca.ttests3c, permanova.p3c, PLS.p3c)
distrplot.df3c <- summary3c %>%
gather(-dataset, key = "method", value = "p.value")
ggplot(distrplot.df3c, aes(x = p.value, fill = method)) +
geom_density(alpha = 0.3) +
geom_vline(aes(xintercept = 0.05), linetype = 2) +
ggtitle("EXTREME Needle-in-Haystack")
And now we see where PLS-DA outperforms PERMANOVA. Again, I’m not entirely sure why, but I think its because the test statistic that PERMANOVA calculates is a ratio of between- to within-group covariances. So if covariance is high within groups and and not between groups, F is small, and p is large. But PLS-DA doesn’t seem to care as much.
And just to double confirm the PLS-DA, the \(Q^2\) value is also generally acceptable in this circumstance.
map_df(plsda3c.list, ~get_plotdata(.)$model_stats) %>%
select(-pre, -ort) %>% #these are just the number of predictive and orthogonal axes
summarise_all(mean)
map_df(plsda3c.list, ~get_plotdata(.)$model_stats) %>%
ggplot(aes(x = `Q2(cum)`)) +
geom_density()